Subword-Based Text Retrieval

نویسندگان

  • Udo Hahn
  • Martin Honeck
  • Stefan Schulz
چکیده

Document retrieval in languages with a rich and complex morphology – particularly in terms of derivation and (single-word) composition – suffers from serious performance degradation with the stemming-only query-term-totext-word matching paradigm. We propose an alternative approach in which morphologically complex word forms are segmented into relevant subwords (such as stems, prefixes, suffixes), and subwords constitute the basic unit for indexing and retrieval. We evaluate our approach on a large biomedical document collection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Subword-based approaches for spoken document retrieval

This paper explores approaches to the problem of spoken document retrieval (SDR), which is the task of automatically indexing and then retrieving relevant items from a large collection of recorded speech messages in response to a user specified natural language text query. We investigate the use of subword unit representations for SDR as an alternative to words generated by either keyword spott...

متن کامل

An Investigation of Subword Unit Representations for Spoken Document Retrieval

This study investigates the feasibility of using subword unit representations for spoken document retrieval as an alternative to using words generated by either keyword spotting or word recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recogn...

متن کامل

Open-vocabulary spoken document retrieval based on new subword models and subword phonetic similarity

A new type of video retrieval system is proposed that identifies a target video section by searching for a word passage submitted as a quoted speech or text query. The proposed system has two unique characteristics. The first characteristic is that it is based on subword models such as phonemes, syllables, and morphemes so the system is able to deal with any type of query, including new words a...

متن کامل

Multi-Scale Spoken Document Retrieval for Cantonese Broadcast News

This paper presents the application of a multi-scale paradigm to Chinese spoken document retrieval (SDR) for improving retrieval performance. Multi-scale refers to the use of both words and subwords for retrieval. Words are basic units in a language that carry lexical meaning and subword units (such as phonemes, syllables or characters) are building components for words. Retrieval using subword...

متن کامل

Multi-scale-audio indexing for translingual spoken document retrieval

MEI (Mandarin-English Information) is an English-Chinese crosslingual spoken document retrieval (CL-SDR) system developed during the Johns Hopkins University Summer Workshop 2000. We integrate speech recognition, machine translation, and information retrieval technologies to perform CL-SDR. MEI advocates a multi-scale paradigm, where both Chinese words and subwords (characters and syllables) ar...

متن کامل

Image retrieval using the combination of text-based and content-based algorithms

Image retrieval is an important research field which has received great attention in the last decades. In this paper, we present an approach for the image retrieval based on the combination of text-based and content-based features. For text-based features, keywords and for content-based features, color and texture features have been used. Query in this system contains some keywords and an input...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003